Chinese document layout analysis using an adaptive regrouping strategy

نویسندگان

  • Fu Chang
  • Shih-Yu Chu
  • Chi-Yen Chen
چکیده

In document layout analysis, the defining conditions for textlines and text regions involve certain numerical parameters (e.g. inter-character spacing and inter-textline spacing) whose values can only be estimated when textlines and text regions have already been formed. This seemingly chicken-and-egg problem can be solved through an adaptive regrouping strategy, which consists of three operations. First, we group basic ingredients into preliminary textlines and text regions according to crude parametric values. Second, we refine our estimate of the parametric values based on the groups thus formed. Third, we form new groups by splitting and merging existing groups based on the newly estimated values. This paper applies the above strategy to Chinese documents whose complexity derives from the coexistence of horizontal and vertical textlines. Successful results are obtained using this approach. The accuracy rates for identifying text regions and textlines are above 98% in a test database consisting of over one thousand document samples and various layout structures.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Distributed Autonomous Agents for Chines Document Images Segmentation

In Chinese document image processing, text and/or graphical block detection serves as an essential step in document layout analysis that in turn permits the eeective reasoning about the logical relationships among various text paragraphs and graphical entities for the purpose of document understanding. This paper presents a novel computational paradigm for extracting text/graphic blocks from Ch...

متن کامل

Basedon Adaptive Split - and - Merge andQualitative Spatial Reasoning

The ultimate goal of automatic document processing is to understand the semantics of a document. Towards such an end, one of the primary enabling steps has been to rst reason about the layout of the document by means of page segmentation and segment spatial reasoning or labeling. This, in turn, allows for the derivation of document logical organization. This paper describes a generic document s...

متن کامل

Multi-view hac for Semi-supervised Document Image Classification

This paper presents a semi-supervised document image classification system that aims to be integrated into a commercial document reading software. This system is asserted like an annotation help. From a set of unknown document images given by a human operator, the system computes regrouping hypothesis of same physical layout images and proposes them to the operator. Then he can correct them, va...

متن کامل

Word Spotting in Chinese Document Images without Layout Analysis

An approach to searching user-specified words/phrases in Chinese document images, without the requirements of layout analysis, is proposed in this paper. Bounding boxes of Chinese character images are first determined using connected component analysis. Next, a suitable character from the user-specified word/phrase is chosen as the initial character to search for a matching candidate in the doc...

متن کامل

Selective CRLA based Layout Analysis and Text Region Extraction from Low Quality Document Images

This paper aims at detecting textual regions by separating graphical regions using Selective CRLA scheme and statistical textual properties on noise infected and low resolution newspaper images. A Bottom Up approach is adopted (i.e.) Selective Constrained Run Length algorithm (CRLA) is applied to obtain the layouts and region growing method over it, segments the homogeneous regions. Statistical...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Pattern Recognition

دوره 38  شماره 

صفحات  -

تاریخ انتشار 2005